Skip to main content

View Post [edit]

Poster: Nemo_bis Date: Jul 7, 2014 10:37am
Forum: faqs Subject: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

There were countless discussions of this, unsurprisingly, but as a simple volunteer I'm surprised of how little constructive criticism they contained.

The matter is well known:
https://archive.org/about/faqs.php#2
https://archive.org/about/exclude.php
http://www2.sims.berkeley.edu/research/conferences/aps/removal-policy.html

The Internet Archive doesn't run for free, it has huge costs. Surprisingly low for the level of service it provides, but still huge. When you ask more access, have you first asked yourself if *you* would pay for additional legal costs it may happen to cause?

Shouldn't we instead be happy that resources have been invested on removing the 6 months embargo and on allowing on-demand archival of URLs, so that now we can immediately enjoy crawls *and* ask our own?

Until the Oakland Archive Policy is superseded, the Internet Archive is not going to change their policies. Is there an alternative standard that one could adopt? If not, who's going to make one? Probably netpreserve.org and IFLA would need to be involved at least.

If you don't like the current policy, work to create one that will do a better service for the public while being a legal defense strong enough to safeguard the Internet Archive...

Some more links for additional instruction:
https://archive.org/post/407088/honoring-present-instead-of-past-robotstxt-is-illogical
https://archive.org/post/1009682/archived-pages-should-be-unaffected-by-robotstxt-changes
https://archive.org/post/1001794/retroactive-and-permanent
https://archive.org/post/433848/domain-resellers-blocking-waybackmachine
https://archive.org/post/225623/retroactive-robotstxt
https://archive.org/post/188806/retroactive-robotstxt-and-domain-squatters
https://archive.org/post/184024/robotstxt-policy-is-a-failure
https://archive.org/post/62230/retroactive-robotstxt-exclusion-different-domain-owner
https://archive.org/post/8920/cybersquatters-copyright-ownership
https://archive.org/post/602721/remove-archived-webpages-when-domain-was-in-hands-of-previous-owner
https://archive.org/post/557165/will-past-crawls-stay-removed-after-removing-robotstxt
https://archive.org/post/423432/domainsponsorcom-erasing-prior-archived-copies-of-135000-domains
https://archive.org/post/401162/parked-domains-robotstxt-disallows-viewing-of-past-content
https://archive.org/post/406315/archived-sites-being-made-no-longer-available-due-to-current-robotstxt
https://archive.org/post/280486/domain-name-re-sold-robots-problem

Reply [edit]

Poster: metaeducation Date: Mar 24, 2016 11:23am
Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

> When you ask more access, have you first asked
> yourself if *you* would pay for additional legal costs
> it may happen to cause?

There are various entities I'd hope would be willing to get in the fight if someone were to sue (the EFF, to name one).

Either way, it would seem there should be a way to irrevocably greenlight the Internet Archive on content. A license on the content can already do this.

For instance a Creative Commons license: if my blog is entirely CC-BY-SA content, then shouldn't the archive be able to keep it up regardless of some hypothetical later state of robots.txt? There could be something more selective, a "Internet Archive License", so even otherwise copyrighted sites could greenlight the archive having a copy.

If it has to be an opt-in process, then that's unfortunate. But I'd certainly prefer to be able to "opt-in to future domain squatters not being able to erase my existence" over having no choice at all...

Reply [edit]

Poster: Hjulle Date: Mar 4, 2015 12:50am
Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

None of the documents you refer to says that a new owner should be allowed to remove the old owners content from the Internet Archive. Allowing that does not make any sense, but it's still the way it works right now.

This will also become a growing problem, as more and more webmasters die (or otherwise become unable to pay for their domain). If the domain switches owner, the new owner should not have any power over the old owners content.

Reply [edit]

Poster: Hjulle Date: Mar 4, 2015 12:54am
Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

This page http://www2.sims.berkeley.edu/research/conferences/aps/removal-policy.html doesn't even say that the bot obays "User-agent: *" (it requires "User-agent: ia_archiver"). So that is a second way in which it is more restrictive than the Oakland Archive Policy requires.

A reasonable compromise would be to make "User-agent: *" only affect the current version, and make "User-agent: ia_archiver" retroactive. That way, you wouldn't remove history by mistake, but you could still remove it just as easily and you wouldn't have to change any of the policy documents.

Also note that "The Robot Exclusion Standard does not mention anything about the "*" character in the Disallow: statement. " - https://en.wikipedia.org/wiki/Robots_exclusion_standard#Universal_.22.2A.22_match

Reply [edit]

Poster: CogDogBlog Date: Jun 28, 2016 11:55am
Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Fourteen years worth of my early web work in education (1993-2006) have vanished from the archive, reportedly because of robots.txt. However, it's not an inclusion or exclusion problem, but because of some IT person mangled a DNS forwarding entry, and the domain for the archive does not connect to anything.

So if robots.txt is not found at all, the IA wipes it out? Hardly archival to my simple mind. The full story http://cogdogblog.com/2016/06/dont-archive/

Reply [edit]

Poster: Nemo_bis Date: Mar 4, 2015 1:33am
Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Your interpretation that "*" does not (or should not) imply "ia_archiver" for the sake of the Oakland Archive Policy is an interesting one, but let me say it's a bit adventurous. That might be a way out legally speaking, but it's not self-evident.

Just think of all the emails or support requests which might com from webmasters confused by the (non) interpretation of "*": increasing the workload like that would defeat the purpose. I can understand why IA prefers a conservative (customary?) interpretation for now and I trust them to switch to a less defensive interpretation whenever that's more sustainable than the opposite.

Reply [edit]

Poster: Menelmacar Date: Apr 2, 2015 4:02pm
Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

"Just think of all the emails or support requests which might com from webmasters confused by the (non) interpretation of "*": increasing the workload like that would defeat the purpose. I can understand why IA prefers a conservative (customary?) interpretation for now and I trust them to switch to a less defensive interpretation whenever that's more sustainable than the opposite."

That's the thing: There's nothing customary about it. The robots.txt standard was invented to affect the *current* behavior of crawlers. Stopping/limiting current crawling was all it ever was ever drafted to do. As far as I've seen, it was never proposed that compliant robots would be expected to perform actions elsewhere, such as modifying existing databases.

See:
http://www.robotstxt.org/orig.html
http://www.robotstxt.org/norobots-rfc.txt
http://en.wikipedia.org/wiki/Robots.txt

The "Oakland Archive Policy" that IA defers to ( http://www2.sims.berkeley.edu/research/conferences/aps/removal-policy.html ) tries to use robots.txt for a purpose it was never designed for. It's a Band-Aid for the fact that there never was (and likely never will be, given the legal tangles involved) a dedicated mechanism for sites to declare whether it's ok for archiving sites to retain permanent copies.

For it's part, robots.txt was never even approved by a major standards body as a standard. It's only a de facto one, which one would think (note: IANAL) might make its use in a legal context even more problematic.

It's unfortunate that there hasn't (to my knowledge) been enshrined into a law protection similar to what exists for temporary caching ( http://en.wikipedia.org/wiki/Online_Copyright_Infringement_Liability_Limitation_Act#Other_safe_harbor_provisions ) , for cases where Internet archiving is provided to the public in an essentially unmodified form for no profit. Given the immense value of a resource like IA to society, ideally something would be worked out to put a site like IA on safer footing.

I think the long and the short of the problem is that IA doesn't have the legal staff, legislated liability protection, or access to standardized authorization protocols that would put them on safer legal ground, nor enough staff to handle enormous volumes of takedown requests, so they feel like they have to go to enormous lengths to be cautious.

I do wish they could at least correlate it against whois records though. My heart sinks any time this happens. It'll definitely become a worse and worse problem as time goes on.

*Sigh* One more reason to loathe %*&^$*ing domain squatting. (Sorry, "domain parking". Ugh.)

Reply [edit]

Poster: Nemo_bis Date: Apr 2, 2015 11:32pm
Forum: faqs Subject: Re: Customary syntax and liability

As for customary, I *only* meant the usage of "*" as wildcard.

As for legal protection, you're very right. I wonder if https://www.manilaprinciples.org/ would help.

Reply [edit]

Poster: Hjulle Date: Mar 4, 2015 1:46am
Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

It is not conservative to retroactively remove all content based on a "User-agent: *". And no document mentions that being a valid operation in robots.txt (specifically, the syntax for removing the archive is very explicit). People would expect robots not to crawl sites with "User-agent: *", the wouldn't expect them to remove the archive for them.

But according to https://archive.org/post/423432/domainsponsorcom-erasing-prior-archived-copies-of-135000-domains
they already do that. Only the "User-agent: ia_archiver" should remove anything, so my point was irrelevant.

I drew my first conclusion from this site https://web.archive.org/web/*/http://www.testblogpleaseignore.com/2012/06/22/the-trouble-with-frp-and-laziness/ not having any archive, while the (new) robots.txt only says "User-agent: *".